1 SAS Foundation Interfaces for Hadoop

Tool Purpose Product
FILENAME statement Allow the DATA step to read and write HDFS data files. Base SAS
PROC HADOOP Copy or move files between SAS and Hadoop. Execute MapReduce and Pig code. Execute Hadoop file system commands to manage files and directories. Base SAS
SQL Pass-Through Submit HiveQL queries and other HiveQL statements from SAS directly to Hive for Hive processing. Query results are returned to SAS. SAS/ACCESS Interface to Hadoop
LIBNAME Statement For Hadoop Access Hive tables as SAS data sets using the SAS programming language. SAS/ACCESS engine translates SAS language into HiveQL and attempts to convert the processing into HiveQL before returning results to SAS. SAS/ACCESS Interface to Hadoop

2 SAS In-Memory Analytics Interfaces for Hadoop

2.1 A Hadoop Cluster to Run SAS

2.2 Base SAS: FILENAME for Hadoop

2.3 Base SAS: The Same Process for PROC HADOOP

2.4 SAS/ACCESS: SQL Pass-Through and LIBNAME

2.5 In-Memory Analytics

2.6 Computing Environment

3 Managing Files and Executing Hadoop Commands, Map-Reduce, and Pig

4 Part 1: Introduction to Base SAS Methods for Hadoop

4.1 Base SAS Tools for Hadoop

  • The Hadoop FILENAME statement enables you to do the following:
    • upload local data to Hadoop using the DATA step
    • read data from Hadoop using the DATA step
  • PROC HADOOP enables you to do the following:
    • submit Hadoop file system (HDFS) commands
    • submit MapReduce programs
    • submit Pig script

4.2 The Hadoop Config.xml File

  • The FILENAME statement for Hadoop and PROC HADOOP require an option that specifies a Hadoop configuration file (config.xml).
    • The configuration file defines how to connect to Hadoop.
    • The file must be accessible to the SAS client application.
    • A SAS administrator commonly manages this configuration for the SAS users.
    • This file is often referred to as the Hadoop core-site.xml file.

4.3 Hadoop JAR Files

  • A collection of Hadoop JAR files is also required on the SAS client machine.
    • An environment variable SAS_HADOOP_JAR_PATH on the SAS client machine defines the location of the Hadoop JAR files
    • The Hadoop JAR files must be compatible with the specific Hadoop implementation and can be copied from the Hadoop server.
    • A Hadoop system administrator commonly manages the configuration of the Hadoop JAR files for the SAS users.

4.4 Base SAS Interface to Hadoop

5 Part 2: The HADOOP FILENAME Statement and PROC HADOOP

5.1 Business Scenario

  • We want to develop a prototype for a process that uses SAS to orchestrate the following scenario:
    1. Move unstructured text files into the Hadoop file system.
    2. Invoke MapReduce programs developed by Java programmers in order to:
      • read and process the text files to perform various analyses
      • output results as text files in the Hadoop file system
    3. Read the summarized text analysis results back into SAS for further analysis and reporting purposes.

5.1.1 Breakdown 1

  1. Move unstructured text files into the Hadoop file system.
  • PROC HADOOP and the HDFS COPYFROMLOCAL statement

5.1.2 Breakdown 2

  1. Invoke MapReduce programs developed by Java programmers in order to:
    • read and process the text files to perform various analyses
    • output results as text files in the Hadoop file system
  • PROC HADOOP and the MAPREDUCE statement

5.1.3 Breakdown 3

  1. Read the summarized text analysis results back into SAS for further analysis and reporting purposes.
  • FILENAME statement for Hadoop and DATA step

5.2 Business Scenario Pseudocode

proc hadoop...;
  hdfs copyfromlocal="local file" out="hdfs file";
run;

proc hadoop...;
  mapreduce input="hdfs file" output="hdfs outfile" ...;
run;

filename fileref "hdfs outfile"...;
  data somedata;
  infile fileref input...;
  ...
run;

5.2.1 Breakdown 1

  1. Move unstructured text files into the Hadoop file system
proc hadoop...;
  hdfs copyfromlocal="local file" out="hdfs file";
run;

5.2.2 Breakdown 2

  1. Input the hdfs file to MapReduce program and output results to hdfs file
proc hadoop...;
  mapreduce input="hdfs file" output="hdfs outfile" ...;
run;

5.2.3 Breakdown 3

  1. Read the MapReduce output with SAS for further processing
filename fileref "hdfs outfile"...;
  data somedata;
  infile fileref input...;
  ...
run;

5.3 PROC HADOOP

  • PROC HADOOP submits:
    • Hadoop file system (HDFS) commands
    • MapReduce programs
    • PIG language code
PROC HADOOP <Hadoop-server-option(s)>;
  HDFS <Hadoop-server-option(s)> <hdfs-command-option(s)>;
  MAPREDUCE <Hadoop-server-option(s)> <mapreduce-option(s)>;
  PIG <Hadoop-server-option(s)> <pig-code-option(s)>;
  PROPERTIES <configuration-properties>;
RUN;

5.4 HDFS Statements

HDFS COPYFROMLOCAL="local-file" OUT="output-location" <DELETESOURCE> <OVERWRITE>;

HDFS COPYTOLOCAL="HDFS-file" OUT="output-location" <DELETESOURCE> <OVERWRITE> <KEEPCRC>;

HDFS DELETE="HDFS-file" <NOWARN>;

HDFS MKDIR="HDFS-path";

HDFS RENAME="HDFS-file" OUT="new-name";

5.5 Moving a File from SAS to Hadoop

  • This program creates a directory in the Hadoop file system (HDFS) and copies a file from the SAS server to the new HDFS directory.
filename hadconfg "/work/hadoop_config.xml";
proc hadoop options=hadconfg username="&std" verbose;
  hdfs mkdir="/user/&std/data";
  hdfs copyfromlocal="/work/DIACCHAD/data/moby_dick_via_sas.txt" out="/user/&std/data";
run;

5.6 Execute MapReduce Code

MAPREDUCE <Hadoop-server-option(s)> <mapreduce-option(s)>;
proc hadoop options=hadconfg username="&std";
  mapreduce
    input="source-file"
    output="target-file"
    jar="jar file containing MapReduce code"
    outputkey="output key class (in MapReduce code)"
    outputvalue="output value class"
    reduce="reducer class"
    combine="combiner class"
    map="map class";
run;

5.6.1 Breakdown 1

input="source-file"
  • The input file in HDFS the MapReduce program reads
output="target-file"
  • The output file in HDFS the MapReduce program writes to

5.6.2 Breakdown 2

jar="jar file containing MapReduce code"
  • The JAR file containing the MapReduce program and named classes

5.6.3 Breakdown 3

outputkey="output key class (in MapReduce code)"
  • The name of the output key class in dot notation
outputvalue="output value class"
  • The name of the output value class in dot notation

5.6.4 Breakdown 4

reduce="reducer class"
combine="combiner class"
map="map class";
  • The Java classes in the map reduce program that execute the map, reduce and combine steps

5.7 MapReduce Example

  • In the demonstration, PROC HADOOP will be used to invoke a MapReduce program that will do the following:
    • read a text file containing free unstructured text. This file can be distributed in multiple data nodes in Hadoop
    • parse the text into the individual words in each data node
    • count up the number of instances of each unique word found in each data node
    • combine total counts for each unique word across nodes
    • write the results to an HDFS output file
proc hadoop options=hadconfg username="&std" verbose;
  mapreduce
    jar = "<hdfs path>/hadoop-mr1-cdh.jar"
    input = "<hdfs path>/moby_dick_via_sas.txt"
    map = "org.apache.hadoop.examples.WordCount$TokenizerMapper"
    reduce = "org.apache.hadoop.examples.WordCount$IntSumReducer"
    combine = "org.apache.hadoop.examples.WordCount$IntSumReducer"
    outputkey = "org.apache.hadoop.io.Text"
    outputvalue = "org.apache.hadoop.io.IntWritable"
    output = "<hdfs path>/mapoutput"
  ;
run;

5.7.1 Breakdown 1

  1. Read a text file containing free unstructured text
input = "<hdfs path>/moby_dick_via_sas.txt"

5.7.2 Breakdown 2

  1. In parallel in each data node, parse the text into the individual words
map = "org.apache.hadoop.examples.WordCount$TokenizerMapper"

5.7.3 Breakdown 3

  1. In each data node, in parallel, count up the number of instances of each unique word found
reduce = "org.apache.hadoop.examples.WordCount$IntSumReducer"

5.7.4 Breakdown 4

  1. Combine the counts for each unique word across data nodes to find final counts
combine = "org.apache.hadoop.examples.WordCount$IntSumReducer"

5.7.5 Breakdown 5

  1. The output contains each unique word (outputkey) as Text and the number of times it occurred in the input file as Integer
outputkey = "org.apache.hadoop.io.Text"
    outputvalue = "org.apache.hadoop.io.IntWritable"

5.7.6 Breakdown 6

  1. The HDFS location the output is written to
output = "<hdfs path>/mapoutput"

5.8 The FILENAME Statement for Hadoop

  • In SAS, the FILENAME statement associates a fileref with an external file and an output device or access method.
FILENAME fileref <device type or access method> "external file" <options>;
filename in hadoop "Hadoop-file-path" concat cfg=xml-config-file user="&std";

5.8.1 Breakdown 1

in
  • file reference

5.8.2 Breakdown 2

hadoop
  • Access method

5.8.3 Breakdown 3

"Hadoop-file-path"
  • The directory containing the concatenated files to read

5.8.4 Breakdown 4

concat
  • Specifies to read each file in the directory defined by the Hadoop file path
  • Note: This option is valid only for reading (not writing) Hadoop files with the FILENAME statement.

5.8.5 Breakdown 5

cfg=xml-config-file
  • Points to the location of the Hadoop configuration file on the machine where SAS is executing.

5.8.6 Breakdown 6

user="&std"
  • The user ID to connect to Hadoop

5.9 Reading a Hadoop File with a DATA Step

filename hadconfg "/work/hadoop_config.xml";

filename mapres hadoop "/user/&std/data/mapoutput" concat cfg=hadconfg user="&std";

data work.commonwords;
  infile mapres dlm="09"x;
  input word $ count;
  ...
run;
  • hadconfg <=====> cfg=hadconfg
  • mapres <=====> infile mapres
  • "09"x is the hex code constant for the tab character

5.9.1 Another Example

filename hadconfg "/work/hadoop_config.xml";
filename orders hadoop "/user/shared/data/custorders.txt" cfg=hadconfg user="&std";

data work.custorders;
  infile orders;
  input @1 customer_id 8. ...;
run;
proc print data=work.custorders;
run;
  • Does the DATA step read a single HDFS file or a concatenated directory?

5.10 Base SAS: FILENAME for Hadoop (Review)

5.11 Reading a Hadoop File with a DATA

5.11.1 Step 1

5.11.2 Step 2

5.11.3 Step 3

5.12 Writing a Hadoop File with a DATA

filename hadconfg "/work/hadoop_config.xml";
filename out hadoop "/user/&std/data/custord" dir cfg=hadconfg user="&std";

data _null_;
  set work.custorders;
  file out(corders) dlm=",";
  put customer_id
      country
      gender
      birth_date
      product_id
      order_date
      quantity
      costprice_per_unit;
run;
  • dir: To write to files in the directory specified by the Hadoop file path
  • corders: Create a file called corders in the HDFS directory "/user/&std/data/custord"

5.12.1 Step 1

5.12.2 Step 2

5.12.3 Step 3

5.13 Use Ambari to Browse the Hadoop File System

6 Part 3: Executing Pig Code With PROC HADOOP

6.1 Executing Pig Code with PROC HADOOP

PIG CODE=fileref | "external-file"
    PARAMETERS=fileref | "external-file"
    REGISTERJAR="external-file(s)"
;
filename pigcode "/workshop/DIACCHAD/pigcode.txt";

proc hadoop options=hadconfg username="hdfs" verbose;
  pig code=pigcode;
run;
  • pigcode.txt:
A = LOAD '/user/shared/data/custord'
    USING PigStorage (',')
    AS (customer_id, country, gender, birth_date, product_id,
        order_date, quantity, costprice_per_unit);

B = FILTER A BY gender == 'F';

store B into '/user/shared/data/student1';

6.1.1 Pig Code Breakdown 1

A = LOAD '/user/shared/data/custord'
    USING PigStorage (',')
    AS (customer_id, country, gender, birth_date, product_id,
        order_date, quantity, costprice_per_unit);
  • Load the comma delimited file and name each field

6.1.2 Pig Code Breakdown 2

B = FILTER A BY gender == 'F';
  • Subset the loaded file for records where gender = ‘F’

6.1.3 Pig Code Breakdown 3

store B into '/user/shared/data/student1';
  • Store the results in the Hadoop file system directory indicated

7 Using the SQL Pass-Through Facility

7.1 SAS/ACCESS Interface to Hadoop

7.2 SQL Pass-Through Query Example

proc sql;
  connect to hadoop (server=namenode port=10000 subprotocol=hive2 schema=diacchad user="&std");
    select * from connection to hadoop
      (select employee_name, salary
        from salesstaff
        where emp_hire_date between '2011-01-01' and '2011-12-31'
      );
  disconnect from hadoop;
quit;
  • Note: The query in the bracket is sent directly to Hive and is executed as a HiveQL query by Hive.

7.3 Joining Multiple Tables from Different Databases

7.4 Creating a SAS File from Hive Results

7.5 Using HiveQL DDL Statements in SAS

8 Using the SAS/ACCESS LIBNAME Engine

9 The LIBNAME Statement (Review)

LIBNAME libref 'SAS-data-library' <options>;

10 The SAS/ACCESS LIBNAME Statement

libname hivedb hadoop server=namenode
        subprotocol=hive2
        port=10000 schema=diacchad
        user=studentX pw=StudentX;
LIBNAME libref engine-name <connection-options>
<LIBNAME-options>;
23    libname hivedb hadoop server=namenode
24        subprotocol=hive2
25        port=10000 schema=diacchad
26        user="&std" pw="&stdpw";
NOTE: Libref HIVEDB was successfully assigned as follows:
  Engine: HADOOP
  Physical Name: jdbc:hive2://namenode:10000/diacchad

11 LIBNAME Statement Connection Options

12 Listing of Hive Tables in the Schema

13 Contents of a HiveTable

14 Listing of a Hive Table

15 SAS/ACCESS Engine Implicit Pass-through

16 Implicit versus Explicit SQL Pass-through

17 Optimizing Implicit Pass-through

18 The SASTRACE= SAS System Option

18.1 SASTRACE= SAS System Option 2

19 NOSTSUFFIX SAS System Option

20 SASTRACELOC= SAS System Options

21 SASTRACE= Messages

22 Using the MEANS and FREQ Procedures

22.1 Using the MEANS and FREQ Procedures 2

22.2 Using the MEANS and FREQ Procedures 3

23 Supported SAS Language Functions

24 Using a Supported SAS Function

25 Using a Non-Supported Function

26 Using SAS Data Set Options

27 Selected SAS Data Set Options

28 SASDATEFMT= SAS/ACCESS Data Set Option

29 SASDATEFMT= SAS/ACCESS Data Set Option

30 Creating a SAS PROC SQL View

30.1 Creating a SAS PROC SQL View 2

31 Using a SAS PROC SQL View

32 Ways to Combine Data

33 Passing Joins to Hive

34 Joining Tables from a Single Connection

35 Joining Tables from Multiple Connections

36 SQL Set Operators

37 Stacking Tables Using Set Operators

38 Combining Hive Tables with Other Sources

39 Combining SAS Data Set and Hive Table

40 Copying Data Sets to Hive